Artificial Neural Network Optical Hardware Accelerator

ABSTRACT

The present disclosure advantageously provides an Optical Hardware Accelerator (OHA) for an Artificial Neural Network (ANN) that includes a communication bus interface, a memory, a controller, and an optical computing engine (OCE). The OCE is configured to execute an ANN model with ANN weights. Each ANN weight includes a quantized phase shift value θi and a phase shift value ϕi. The OCE includes a digital-to-optical (D/O) converter configured to generate input optical signals based on the input data, an optical neural network (ONN) configured to generate output optical signals based on the input optical signals, and an optical-to-digital (O/D) converter configured to generate the output data based on the output optical signals. The ONN includes a plurality of optical units (OUs), and each OU includes an optical multiply and accumulate (OMAC) module.

BACKGROUND

The present disclosure relates to computer systems. More particularly,the present disclosure relates to computer systems that include neuralnetworks.

Artificial neural networks (ANNs), such as deep neural networks (DNNs),convolutional neural networks (CNNs), etc., are a popular solution to awide array of challenging classification, recognition and regressionproblems. However, many ANNs require a large number of calculationsinvolving a large number of weights and activations, which presents asignificant challenge with respect to performance, access and storage,particularly for mobile and other power or storage-constrained devices.

An ANN hardware accelerator increases the speed of these calculationswhen compared to the central processor of a mobile device. ANN hardwareaccelerators may include one or more processors, coprocessors, matrixmultiplier units, multiply-and-accumulate (MAC) arrays, etc. Forexample, a common approach to implementing the convolutional layers of aCNN is to convert the convolution operations into generic matrixmultiplication (GEMM) operations that are performed by the ANN hardwareaccelerator. However, these GEMM operations consume significantcomputing power due to the large number of multiplications required.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts an ANN, in accordance with an embodiment of the presentdisclosure.

FIG. 2 depicts a CNN, in accordance with an embodiment of the presentdisclosure.

FIG. 3 depicts a block diagram of a system, in accordance withembodiments of the present disclosure.

FIG. 4 depicts a block diagram of an optical hardware accelerator for anANN, in accordance with embodiments of the present disclosure.

FIG. 5 depicts a block diagram of an optical computing engine, inaccordance with embodiments of the present disclosure.

FIG. 6A depicts a block diagram of an optical multiply and accumulatemodule and FIG. 6B depicts a block diagram of an optical multiply andaccumulate element, in accordance with embodiments of the presentdisclosure.

FIG. 7 depicts a block diagram of weight matrix, in accordance withembodiments of the present disclosure.

FIG. 8 depicts a block diagram of an optical hardware accelerator for anANN, in accordance with an alternative embodiment of the presentdisclosure.

FIG. 9 depicts a flow diagram presenting functionality for acceleratingan ANN using an optical hardware accelerator, in accordance withembodiments of the present disclosure.

DETAILED DESCRIPTION

Embodiments of the present disclosure will now be described withreference to the drawing figures, in which like reference numerals referto like parts throughout.

Embodiments of the present disclosure advantageously provide an opticalhardware accelerator (OHA) for an ANN that includes an optical computingengine (OCE) that is configured to execute an ANN model with ANNweights. The ANN weights are phase shift values that are determined andquantized during the training of the ANN model. Due to the physicalproperties of the optical medium and the quantized phase shift values,the ANN optical hardware accelerator advantageously provides much fastercomputation, significantly reduces power consumption, and reduces memorybandwidth when compared to conventional ANN hardware accelerators.

In one embodiment, an OHA for an ANN includes a communication businterface, a memory coupled to the communication bus interface, acontroller coupled to the communication bus interface and the memory,and an OCE coupled to the memory and the controller. The OCE isconfigured to execute at least a portion of an ANN model with ANNweights, each ANN weight including a quantized phase shift value θ_(i)and a phase shift value ϕ_(i). The OCE includes a digital-to-optical(D/O) converter configured to generate input optical signals based onthe input data, an optical neural network (ONN) configured to generateoutput optical signals based on the input optical signals, and anoptical-to-digital (O/D) converter configured to generate the outputdata based on the output optical signals.

The ONN includes a plurality of optical units (OUs). Each OU includes anoptical multiply and accumulate (OMAC) module. Each OMAC module includesan array of OMAC elements, and each OMAC element includes a Mach-ZehnderInterferometer (MZI) and a single-node phase shifter. Each MZI isconfigured to apply a phase shift equal to the quantized phase shiftvalue θ_(i) of a corresponding ANN weight to an optical signal. Eachsingle-node phase shifter is configured to apply a phase shift equal tothe phase shift value ϕ_(i) of the corresponding ANN weight to theoptical signal.

An ANN models the relationships between input data or signals and outputdata or signals using a network of interconnected nodes that is trainedthrough a learning process. The nodes are arranged into various layers,including, for example, an input layer, one or more hidden layers, andan output layer. The input layer receives input data, such as, forexample, image data, and the output layer generates output data, suchas, for example, a probability that the image data contains a knownobject. Each hidden layer provides at least a partial transformation ofthe input data to the output data. A DNN has multiple hidden layers inorder to model complex, nonlinear relationships between input data andoutput data.

In a fully-connected, feedforward ANN, each node is connected to all ofthe nodes in the preceding layer, as well as to all of the nodes in thesubsequent layer. For example, each input layer node is connected toeach hidden layer node, each hidden layer node is connected to eachinput layer node and each output layer node, and each output layer nodeis connected to each hidden layer node. Additional hidden layers aresimilarly interconnected. Each connection has a weight value, and eachnode has an activation function, such as, for example, a linearfunction, a step function, a sigmoid function, a tanh function, arectified linear unit (ReLU) function, etc., that determines the outputof the node based on the weighted sum of the inputs to the node. Theinput data propagates from the input layer nodes, through respectiveconnection weights to the hidden layer nodes, and then throughrespective connection weights to the output layer nodes.

More particularly, at each input node, input data is provided to theactivation function for that node, and the output of the activationfunction is then provided as an input data value to each hidden layernode. At each hidden layer node, the input data value received from eachinput layer node is multiplied by a respective connection weight, andthe resulting products are summed or accumulated into an activationsignal value that is provided to the activation function for that node.The output of the activation function is then provided as an input datavalue to each output layer node. At each output layer node, the outputdata value received from each hidden layer node is multiplied by arespective connection weight, and the resulting products are summed oraccumulated into an activation signal value that is provided to theactivation function for that node. The output of the activation functionis then provided as output data. Additional hidden layers may besimilarly configured to process data.

FIG. 1 depicts an ANN 10, in accordance with an embodiment of thepresent disclosure.

ANN 10 includes input layer 20, one or more hidden layers 30, 40, 50,etc., and output layer 60. Input layer 20 includes one or more inputnodes 21, 22, 23, etc. Hidden layer 30 includes one or more hidden nodes31, 32, 33, 34, 35, etc. Hidden layer 40 includes one or more hiddennodes 41, 42, 43, 44, 45, etc. Hidden layer 50 includes one or morehidden nodes 51, 52, 53, 54, 55, etc. Output layer 60 includes one ormore output nodes 61, 62, etc. Generally, ANN 10 includes N hiddenlayers, input layer 20 includes “i” nodes, hidden layer 30 includes “j”nodes, hidden layer 40 includes “k” nodes, hidden layer 50 includes “m”nodes, and output layer 60 includes “o” nodes.

In one embodiment, N equals 3, i equals 3, j, k and m equal 5 and oequals 2 (depicted in FIG. 1). Input node 21 is coupled to hidden nodes31 to 35, input node 22 is coupled to hidden nodes 31 to 35, and inputnode 23 is coupled to hidden nodes 31 to 35. Hidden node 31 is coupledto hidden nodes 41 to 45, hidden node 32 is coupled to hidden nodes 41to 45, hidden node 33 is coupled to hidden nodes 41 to 45, hidden node34 is coupled to hidden nodes 41 to 45, and hidden node 35 is coupled tohidden nodes 41 to 45. Hidden node 41 is coupled to hidden nodes 51 to55, hidden node 42 is coupled to hidden nodes 51 to 55, hidden node 43is coupled to hidden nodes 51 to 55, hidden node 44 is coupled to hiddennodes 51 to 55, and hidden node 45 is coupled to hidden nodes 51 to 55.Hidden node 51 is coupled to output nodes 61 and 62, hidden node 52 iscoupled to output nodes 61 and 62, hidden node 53 is coupled to outputnodes 61 and 62, hidden node 54 is coupled to output nodes 61 and 62,and hidden node 55 is coupled to output nodes 61 and 62.

Many other variations of input, hidden and output layers are clearlypossible, including hidden layers that are locally-connected, ratherthan fully-connected, to one another.

Training an ANN includes optimizing the connection weights between nodesby minimizing the prediction error of the output data until the ANNachieves a particular level of accuracy. One method is backpropagation,or backward propagation of errors, which iteratively and recursivelydetermines a gradient descent with respect to the connection weights,and then adjusts the connection weights to improve the performance ofthe network. In other backpropagation methods, the gradient descent isnot needed.

A multi-layer perceptron (MLP) is a fully-connected ANN that has aninput layer, an output layer and one or more hidden layers. MLPs may beused for natural language processing applications, such as machinetranslation, speech recognition, etc. Other ANNs include recurrentneural networks (RNNs), long short-term memories (LSTMs),sequence-to-sequence models that include an encoder RNN and a decoderRNN, shallow neural networks, etc.

A CNN is a variation of an MLP that may be used for classification orrecognition applications, such as image recognition, speech recognition,etc. A CNN has an input layer, an output layer and multiple hiddenlayers including convolutional layers, pooling layers, normalizationlayers, fully-connected layers, etc. Each convolutional layer applies asliding dot product or cross-correlation to an input volume, applies anactivation function to the results, and then provides the activation oroutput volume to the next layer. Convolutional layers typically use theReLU function as the activation function. In certain embodiments, theactivation function is provided in a separate activation layer, such as,for example, a ReLU layer. A pooling layer reduces the dimensions of theoutput volume received from the preceding convolutional layer, and maycalculate an average or a maximum over small clusters of data, such as,for example, 2×2 matrices. In certain embodiments, a convolutional layerand a pooling layer may form a single layer of a CNN. Thefully-connected layers follow the convolutional and pooling layers, andinclude a flatten layer and a classification layer, followed by anormalization layer that includes a normalization function, such as theSoftMax function. The output layer follows the last fully-connectedlayer; in certain embodiments, the output layer may include thenormalization function.

FIG. 2 depicts a CNN 15, in accordance with an embodiment of the presentdisclosure. CNN 15 includes input layer 20, one or more hidden layers,such as convolutional layer 30-1, pooling layer 30-2, hidden (flatten)layer 40, hidden (classification) layer 50, etc., and output layer 60.Many other variations of input, hidden and output layers arecontemplated.

Input layer 20 includes one or more input nodes 21, etc., that presentthe input data, such as a color image, as an input volume to the firstconvolutional layer, e.g., convolutional layer 30-1. The input volume isa three-dimensional matrix that has a width, a height and a depth. Forexample, input data that represent a color image are presented as aninput volume that is 512 pixels×512 pixels×3 channels (red, green,blue); other input volume dimensions may also be used, such as 32×32×3,64×64×3, 128×128×3, etc., 32×32×1, 64×64×1, 128×128×1, 512×512×1, etc.

Convolutional layer 30-1 is locally-connected to input layer 20, andincludes a plurality of nodes that are connected to local regions in theinput volume (not depicted for clarity). For a CNN that uses a standardconvolution, each node computes a dot product between the node's weightsand the respective local region of the input volume. An activationfunction is then applied to the results of each convolution calculationto produce an output volume that is provided as an input volume to thesubsequent layer. The activation function may be applied by eachconvolutional layer node or by the nodes of a subsequentlocally-connected ReLU layer.

Pooling layer 30-2 is locally-connected to convolutional layer 30-1, andincludes a plurality of nodes that are connected to local regions in theinput volume (not depicted for clarity). Pooling layer 30-2 alsoproduces an output volume that is provided as the input volume to thesubsequent layer, such as, for example, another convolutional layer30-1, a flatten layer 40, etc. In certain embodiments, convolutionallayer 30-1 and pooling layer 30-2 form a single hidden layer 30.Similarly, in certain embodiments, convolutional layer 30-1, a ReLUlayer and pooling layer 30-2 form a single hidden layer 30. Generally,the output volumes of the convolutional and pooling layers may bedescribed as feature maps, and one or more single hidden layers 30 forma feature learning portion of CNN 15.

Hidden layer 40 is a “flatten” layer that is locally-connected topooling layer 30-2, and includes one or more hidden (flatten) nodes 41,42, 43, 44, 45, etc. Hidden (flatten) layer 40 “flattens” the outputvolume produced by the preceding pooling layer 30-2 into a columnvector, which is provided to the subsequent, fully-connected hiddenlayer 50.

Hidden layer 50 is a classification layer that is fully-connected tohidden (flatten) layer 40, and includes one or more hidden(classification) nodes 51, 52, 53, 54, 55, etc.

Output layer 60 includes one or more output nodes 61, 62, etc., and isfully-connected to hidden (classification) layer 50. Fully-connectedoutput layer 60 receives the classification results output by hidden(classification) layer 50, and each node outputs a predicted classscore. A normalization function, such as a Softmax function, may beapplied to the predicted class scores by output layer 60, or,alternatively, by an additional layer interposed between hidden(classification) layer 50 and output layer 60.

Similar to ANNs, training a CNN includes optimizing the connectionweights between nodes by minimizing the prediction error of the outputdata until the CNN achieves a particular level of accuracy. As notedabove, backpropagation may be used to iteratively and recursivelydetermines a gradient descent with respect to the connection weights,and then adjusts the connection weights to improve the performance ofthe network. Matrix multiplication operations, and, more particularly,MAC operations, are used extensively by ANNs, CNNs, etc.

FIG. 3 depicts a block diagram of a system 100, in accordance withembodiments of the present disclosure.

System 100 includes communication bus 110 coupled to one or moreprocessors 120, memory 130, I/O interfaces 140, display interface 150,one or more communication interfaces 160, and one or more OHAs 170.Generally, I/O interfaces 140 are coupled to I/O devices 142 using awired or wireless connection, display interface 150 is coupled todisplay 152, and communication interface 160 is connected to network 162using a wired or wireless connection. In many embodiments, certaincomponents of system 100 are implemented as a system-on-chip (SoC) 102;in other embodiments, system 100 may be hosted on a traditional printedcircuit board, motherboard, etc.

Communication bus 110 is a communication system that transfers databetween processor 120, memory 130, I/O interfaces 140, display interface150, communication interface 160, OHA 170, as well as other componentsnot depicted in FIG. 3. Power connector 112 is coupled to communicationbus 110 and a power supply (not shown). In certain embodiments,communication bus 110 is a network-on-chip (NoC).

Processor 120 includes one or more general-purpose orapplication-specific microprocessors that executes instructions toperform control, computation, input/output, etc. functions for system100. Processor 120 may include a single integrated circuit, such as amicro-processing device, or multiple integrated circuit devices and/orcircuit boards working in cooperation to accomplish the functions ofprocessor 120. Additionally, processor 120 may include multipleprocessing cores, as depicted in FIG. 3. Generally, system 100 mayinclude one or more processors 120, each containing one or moreprocessing cores.

For example, system 100 may include 2 processors 120, each containingmultiple processing cores. In certain embodiments, the CPUs form aheterogeneous processing architecture, such as, for example, Arm's“big.LITTLE” architecture, that couples relatively battery-saving andslower processor cores (“LITTLE” cores) with relatively more powerfuland power-hungry processing cores (“big” cores). For example, oneprocessor 120 may be a high performance processor containing 4 “big”processing cores, e.g., Arm Cortex-A73, Cortex-A75, Cortex-A76, etc.,while the other processor 120 may be a high efficiency processorcontaining 4 “little” processing cores, e.g., Arm Cortex-53, ArmCortex-55, etc. In certain embodiments, processor 120 may also beconfigured to execute at least a portion of a classification-basedmachine learning model, such as, for example, an ANN, DNN, CNN, RNN,etc.

In addition, processor 120 may execute computer programs or modules,such as operating system 132, software modules 134, etc., stored withinmemory 130. For example, software modules 134 may include a machinelearning (ML) application, an ANN application, a DNN application, a CNNapplication, an RNN application, etc.

Generally, storage element or memory 130 stores instructions forexecution by processor 120 and data. Memory 130 may include a variety ofnon-transitory computer-readable medium that may be accessed byprocessor 120. In various embodiments, memory 130 may include volatileand nonvolatile medium, non-removable medium and/or removable medium.For example, memory 130 may include any combination of random accessmemory (RAM), dynamic random access memory (DRAM), SRAM, ROM, flashmemory, cache memory, and/or any other type of non-transitorycomputer-readable medium.

Memory 130 contains various components for retrieving, presenting,modifying, and storing data. For example, memory 130 stores softwaremodules that provide functionality when executed by processor 120. Thesoftware modules include operating system 132 that provides operatingsystem functionality for system 100. Software modules 134 providevarious functionality, such as image classification using convolutionalneural networks, etc. Data 136 may include data associated withoperating system 132, software modules 134, etc.

I/O interfaces 140 are configured to transmit and/or receive data fromI/O devices 142. I/O interfaces 140 enable connectivity betweenprocessor 120 and I/O devices 142 by encoding data to be sent fromprocessor 120 to I/O devices 142, and decoding data received from I/Odevices 142 for processor 120. Generally, data may be sent over wiredand/or wireless connections. For example, I/O interfaces 140 may includeone or more wired communications interfaces, such as USB, Ethernet,etc., and/or one or more wireless communications interfaces, coupled toone or more antennas, such as WiFi, Bluetooth, cellular, etc.

Generally, I/O devices 142 provide input to system 100 and/or outputfrom system 100. As discussed above, I/O devices 142 are operablyconnected to system 100 using a wired and/or wireless connection. I/Odevices 142 may include a local processor coupled to a communicationinterface that is configured to communicate with system 100 using thewired and/or wireless connection. For example, I/O devices 142 mayinclude a keyboard, mouse, touch pad, joystick, etc.

Display interface 150 is configured to transmit image data from system100 to monitor or display 152.

Communication interface 160 is configured to transmit data to and fromnetwork 162 using one or more wired and/or wireless connections. Network162 may include one or more local area networks, wide area networks, theInternet, etc., which may execute various network protocols, such as,for example, wired and/or wireless Ethernet, Bluetooth, etc. Network 162may also include various combinations of wired and/or wireless physicallayers, such as, for example, copper wire or coaxial cable networks,fiber optic networks, Bluetooth wireless networks, WiFi wirelessnetworks, CDMA, FDMA and TDMA cellular wireless networks, etc.

OHA 170 is configured to execute machine learning models, such as, forexample, ANNs, DNNs, CNNs, RNNs, etc., in support of variousapplications embodied by software modules 134. Generally, OHA 170includes one or more OCEs, as well as a controller, microcontroller,etc., a communications bus interface, and one or more non-volatileand/or volatile memories, such as, for example, ROM, flash memory, SRAM,DRAM, etc. The OCE implements at least a portion of an ANN model withANN weights as an ONN, which generally includes silicon photoniccircuits, digital-to-optical (D/O) converters, optical-to-digital (O/D)converters, etc. Generally, OHA 170 receives input data from memory 130over communication bus 110, and transmits output data to memory 130 overcommunication bus 110.

FIG. 4 depicts a block diagram of an OHA 170 for an ANN, in accordancewith embodiments of the present disclosure. OHA 170 includes controller172, communication bus interface 174, memory 176, and one or more OCEs180. Controller 172 is coupled to communication bus interface 174,memory 176 and OCE 180, and generally controls the components,functions, data flow, etc. of OHA 170. Communication bus interface 174is coupled to communication bus 110 and memory 176, which is coupled toOCE 180.

In certain embodiments, a single OCE 180 executes the complete ANN modelusing the ANN weights, each of which includes a quantized phase shiftvalue θ_(i) and a phase shift value ϕ_(i), as discussed in more detailbelow. In other embodiments, multiple OCEs 180 may be interconnected bya NoC using a ring topology, a star topology, a mesh topology, etc., or,alternatively, using a cross-bar switch, direct connections, etc. Inthese embodiments, each OCE 180 executes at least a portion of the ANNmodel using a portion of the ANN weights.

FIG. 5 depicts a block diagram of an OCE 180, in accordance withembodiments of the present disclosure.

OCE 180 includes controller 182, memory interface 184, D/O converter186, O/D converter 188 and ONN 190. Controller 182 is coupled to memoryinterface 184, D/O converter 186, O/D converter 188 and ONN 190, as wellas controller 172, and generally controls the components, functions,data flow, etc. of OCE 180. Memory interface 184 is coupled to memory176, D/O converter 186 and O/D converter 188. D/O converter 186 isoptically-coupled to ONN 190 via one or more optical fibers, opticalchannels, etc., and ONN 190 is optically-coupled to O/D converter 188via one or more optical fibers, optical channels, etc.

ONN 190 includes an array of optical units (OUs) 192 that areoptically-coupled, in a particular configuration, to implement an ANNmodel, a portion of an ANN model, etc. In many embodiments, ONN 190implements an entire ANN model. In other embodiments, ONN 190 implementsa portion of the ANN model, such as, for example, one or more layers ofan ANN or CNN model, one or more convolutional layers of a CNN model, aportion of a convolutional layer of a CNN model, etc. Any remaininglayers or portions of layers of the ANN model that are not implementedby ONN 190, such as, for example, one or more fully-connected layers ofa CNN model, etc., may be implemented by a separate digital processor,such as, for example, processor 120, a dedicated coprocessor coupled tocommunication bus 110, a graphics processing unit (GPU) coupled tocommunication bus 110, etc. In the alternative embodiment depicted inFIG. 8 and discussed in more detail below, OHA 270 includes OCE 280 toimplement certain layers of an ANN model, and digital computing engine(DCE) 278 to implement the remaining layers or layer portions of the ANNmodel.

Generally, the configuration of the array of OUs 192 will be determinedby the architecture of the ANN model. FIG. 5 depicts an embodiment inwhich the OU array configuration has one or more rows, and each rowincludes one or more OUs 192. Input optical signals are provided by D/Oconverter 186 to the first OU 192 in each row, and output opticalsignals are provided by the last OU 192 in each row to O/D converter188. In this embodiment, each OU 192 is optically-connected to asucceeding OU 192 in the same row (locally-connected). In otherembodiments, each OU 192 is optically-coupled to a succeeding OU 192 inall of the rows (fully-connected). In further embodiments, a combinationof locally-connected and fully-connected OUs 192 may be implemented.Each layer of the ANN model may include one or more OUs 192 that arelocally-connected or fully-connected.

For example, an ANN model may include an input layer, one or more hiddenlayers and an output layer, and each OU 192 may be assigned to one ofthe ANN model layers. The input optical signals are provided to the OUs192 in the input layer by D/O converter 186. The optical signalsgenerated by the OUs 192 in the input layer are provided to the OUs 192in a succeeding hidden layer. The optical signals generated by the OUs192 in each hidden layer are provided to succeeding OUs 192 in the samehidden layer, the OUs 192 in a succeeding hidden layer or the OUs 192 inthe output layer, depending upon the architecture of the ANN model.Finally, the optical signals generated by the OUs 192 in the outputlayer are provided to O/D converter 188.

In certain embodiments, the configuration of the array of OUs 192supports the processing of a portion of a layer of an ANN, such as, forexample, a portion of a convolutional layer of a CNN. In theseembodiments, the array of OUs 192 provide many advantages over an arrayof MAC units, including much faster computation, significant reductionin power consumption, reduction in memory bandwidth, etc., as discussedabove.

Each OU 192 includes an optical multiply and accumulate (OMAC) module200. Generally, each OMAC module 200 is configured to apply a portion ofthe ANN weights, in the form of quantized phase shift values θ_(i) and aphase shift values ϕ_(i), to optical signals input thereto, and thenoutput optical signals to the next OU 192. If desired, activationfunction(s) may be applied to the digital signals output from O/Dconverter 188 using analog circuitry or digital processing (not shownfor clarity) provided by OCE 180. Alternatively, the digital signalsoutput from O/D converter 188 may be transferred to memory 176 and thenprovided to analog circuitry or digital processing (not shown forclarity) provided by OHA 170, or sent to processor 120, a dedicatedcoprocessor coupled to communication bus 110, a graphics processing unit(GPU) coupled to communication bus 110, etc. For example, in thealternative embodiment depicted in FIG. 8 and discussed in more detailbelow, OHA 270 may include OCE 280 to implement certain layers orportions of certain layers of an ANN model, and DCE 278 to apply theactivation functions.

In certain embodiments, each OU 192 also includes an optical activation(OA) module 230 to receive the optical signals from the associated OMACmodule 200. Each OA module 230 is configured to apply an activationfunction, in the form of a nonlinear phase shift, to the optical signalsoutput from the associated OMAC module 200. Generally, OA module 230 isa nonlinear optical device, such as, for example, a bistable opticalcrystal, a saturable absorber, etc.

FIG. 6A depicts a block diagram of an OMAC module 200 and FIG. 6Bdepicts a block diagram of an OMAC element 220, in accordance withembodiments of the present disclosure.

Each OMAC module 200 includes an array of OMAC elements 220 that form asilicon photonic circuit. In the embodiment depicted in FIG. 6A, OMACmodule 200 includes six (6) OMAC elements 210, i.e., OMAC elements220-1, 220-2, 220-3, 220-4, 220-5 and 220-6, arranged in a triangularstructure. In an alterative embodiment, OMAC elements 210 may bearranged in a rectangular structure. OMAC module 200 opticallytransforms four (4) input optical signals, i.e., optical signal 201(a₁), optical signal 202 (a₂), optical signal 203 (a₃) and opticalsignal 204 (a₄), into four (4) output optical signals, i.e., opticalsignal 205 (y₁), optical signal 206 (y₂), optical signal 207 (y₃) andoptical signal 208 (y₄). As discussed in more detail below, this opticaltransformation represents the matrix multiplication of Y=W*A, where Y isrepresented by output optical signals 205, 206, 207 and 208, W isrepresented by the quantized phase shift values θ_(i) and phase shiftvalue ϕ_(i), and A is represented by input optical signals 201, 202, 203and 204.

In this embodiment, OMAC module 200 includes six OMAC elements 220 thatmultiply a weight matrix (4×4) and an input vector (4×1) to generate anoutput vector (4×1). Different matrix dimensions may be multiplied byusing different numbers of OMAC elements 220 according to therelationship N(N−1)/2, where N is the number of OMAC elements 220. Forexample, multiplying a weight matrix (2×2) and an input vector (2×1) togenerate an output vector (2×1) requires one OMAC element 220,multiplying a weight matrix (6×6) and an input vector (6×1) to generatean output vector (6×1) requires 15 OMAC elements 220 arranged in atriangular or rectangular structure, etc.

With respect to FIG. 6A, generally, each OMAC element 220 receives apair of optical signals and generates a pair of transformed opticalsignals. OMAC element 220 is a two port, silicon photonic device that isconfigured to apply two phase shifts corresponding an ANN weight toinput optical signal 212, i.e., a first phase shift equal to thequantized phase shift value θ_(i), and a second phase shift equal to thephase shift value ϕ_(i).

Optical signals 201 (a₁) and 202 (a₂) are input to OMAC element 220-1.Optical signal 203 (a₃) and transformed optical signal 202 ^(T) (outputby OMAC element 220-1) are input to OMAC element 220-2. Optical signal204 (a₄) and transformed optical signal 203 ^(T) (output by OMAC element220-2) are input to OMAC element 220-3. Transformed optical signal 201^(T) (output by OMAC element 220-1) and transformed optical signal 202^(T) (output by OMAC element 220-2; not labeled for clarity) are inputto OMAC element 220-4. Transformed optical signal 202 ^(T) (output byOMAC element 220-4) and transformed optical signal 203 ^(T) (output byOMAC element 220-3) are input to OMAC element 220-5. Transformed opticalsignal 201 ^(T) (output by OMAC element 220-4) and transformed opticalsignal 202 ^(T) (output by OMAC element 220-5; not labeled for clarity)are input to OMAC element 220-6. OMAC element 220-3 transforms opticalsignal 204 into output optical signal 208, OMAC element 220-5 transformsoptical signal 203 ^(T) into output optical signal 207, OMAC element220-6 transforms optical signals 201 ^(T) and 202 ^(T) into outputoptical signals 205 and 206, respectively.

With respect to FIG. 6B, optical signal 211 is coupled to 3 dB (50%)beam splitter 222 and then 3 dB (50%) beam splitter 226, whichtransforms optical signal 211 into transformed optical signal 211T.Input optical signal 212 is coupled to beam splitter 222, phase shifter224, beam splitter 226 and then phase shifter 228, which transformsinput optical signal 212 into transformed optical signal 212T. Beamsplitter 222, phase shifter 224 and beam splitter 226 form aMach-Zehnder Interferometer (MZI).

The superposition of two amplitudes of coherent input waves performs anarbitrary U(2) transformation, as given by Equation 1, where θ is thephase shift applied by phase shifter 224, and ϕ is the phase shiftapplied by phase shifter 228. In many embodiments, ϕ=0 or π.

$\begin{matrix}{{T\left( {\theta,\phi} \right)} = \begin{pmatrix}{e^{i\;\phi}{\cos(\theta)}} & {{- e^{i\;\phi}}{\sin(\theta)}} \\{\sin(\theta)} & {\cos(\theta)}\end{pmatrix}} & {{Eq}.\mspace{14mu} 1}\end{matrix}$

A matrix is a “unitary” matrix U such that U^(T)U=I. Mathematically, anyunitary matrix can be decomposed into the product of a series of U(2)matrices, as given by Equation 2.

$\begin{matrix}{U = {\prod_{i = 0}^{\frac{n{({n - 1})}}{2} - 1}{T_{i}(\theta)}}} & {{Eq}.\mspace{14mu} 2}\end{matrix}$

In certain embodiments, OMAC element 220 computes the product of a 4×4unit matrix and a 4-value vector. In other embodiments, OMAC element 220may compute an arbitrary-size matrix multiplication Y=W*A. Matrix W(n×m) is pre-processed using a single value decomposition (SVD)W=U∧V^(T), where U is an “n×n” unit matrix, V is an “m×m” unit matrixand A is an “n×m” diagonal matrix. Additionally, U and V may bedecomposed into the products of an SU(2) matrix, as given by Equation 3,and implemented by OMAC element 220.

$\begin{matrix}{U = {\prod_{i = 0}^{\frac{n{({n - 1})}}{2} - 1}{T_{i}(\theta)}}} & {{Eq}.\mspace{14mu} 3}\end{matrix}$

For any “n×n” unit matrix U, the corresponding phase shift values aregiven by Equation 4.

$\begin{matrix}{\left\{ \theta_{i} \right\} = \left\{ {\theta_{0},\theta_{1},\ldots\mspace{14mu},\theta_{\frac{n{({n - 1})}}{2} - 1}} \right\}} & {{Eq}.\mspace{14mu} 4}\end{matrix}$

Additionally, any diagonal element ∧_(i) of ∧ may be represented as cos(Of) after a linear rescaling. In many embodiments, the matrix A may beignored, and the U and V matrices may be programmed onto anappropriately-sized OMAC element 200, with corresponding phase shiftvalues {θ_(i)}, to multiply matrices efficiently.

OMAC module 200 advantageously computes matrix multiplications ofweights and activations in ML applications, such as ANNs. In certainsituations, the phase shift values {θ_(i)} may require high precisionencoding, such as, for example, 32-bit floating point values, fixedpoint values with very wide bit-widths, etc. Embodiments of the presentdisclosure advantageously retrain or finetune an ANN to quantize phaseshift values into discrete angles, such as, for example, the multiple ofa quantum angle γ=360°/256, {θ_(i)}={γk_(i)|k_(i)∈Z₂₅₆}. In thisexample, the phase shift values are 8-bit integers. Other integerbit-widths are also contemplated, such as 2-bit integers, 4-bitintegers, etc. In certain embodiments, the integer bit width may begreater than or equal to 2-bits and less than or equal to 16 bits.

Advantageously, quantized phase shift values reduce the memory requiredto store the ANN weights, which ranges from 1 MB to 100 MB or larger, aswell as the power required execute the ANN model. In an alternativeembodiment, the quantization may be determined statistically without theneed for ANN model training.

Generally, the phase quantization process may be described as follows.First, the ANN model is trained to determine unquantized ANN weights.Each unquantized ANN weight includes an unquantized phase shift valueΘ_(i) and a phase shift value A weight matrix W is formed from theunquantized ANN weights. The weight matrix W is then decomposed into thedirect sum of an “n×n” tile matrix W_(i), as given by Equation 5 anddepicted in FIG. 7, where n is an arbitrary constant that is chosenbased on the size of the OMAC module 200, such as, for example, n=128.Each ANN weight tile includes a number of unquantized ANN weights.

W=⊕_(i)W_(i)  Eq. 5

Each ANN weight tile is quantized to create a quantized ANN weight tile.Each quantized ANN weight tile includes a number of quantized ANNweights. Each quantized ANN weight including a quantized phase shiftvalue θ_(i) and a phase shift value ϕ_(i). In certain embodiments, thephase quantizer is applied to each “n×n” tile matrix W_(i) by performingan SVD to W_(i), i.e., W_(i)=U_(i)∧_(i)V_(i) ^(T), where U_(i) and V_(i)^(T) are unit matrices, a U(2) decomposition to U_(i) and V_(i) ^(T) isperformed (as given by Equations 6 and 7), and phase shift θ_(i) isquantized (as given by Equations 8, 9 and 10), where m is the bit-widthof quantization, and the bracket function, i.e., “[ ],” is the roundingfunction (e.g., [12.35]=12).

$\begin{matrix}{U_{i} = {\prod\limits_{k}^{\frac{n{({n - 1})}}{2}}{T_{i}\left( \theta_{k} \right)}}} & {{Eq}.\mspace{14mu} 6} \\{V_{i} = {\prod\limits_{k}^{\frac{n{({n - 1})}}{2}}{T_{i}\left( \theta_{k}^{\prime} \right)}}} & {{Eq}.\mspace{14mu} 7} \\{{Q\left( \theta_{k} \right)} = \left\lbrack \frac{\theta_{k}}{\gamma} \right\rbrack} & {{Eq}.\mspace{14mu} 8} \\{{Q\left( \theta_{k}^{\prime} \right)} = \left\lbrack \frac{\theta_{k}^{\prime}}{\gamma} \right\rbrack} & {{Eq}.\mspace{14mu} 9} \\{\gamma = \frac{2\pi}{2^{m}}} & {{Eq}.\mspace{14mu} 10}\end{matrix}$

Third, W is approximated using quantized phase Q(θ_(k)) and Q(θ′_(k)) asgiven by Equation 11.

$\begin{matrix}{{W_{i} \approx \overset{\_}{W_{i}}} = {\left( {\prod\limits_{k}^{\frac{n{({n - 1})}}{2}}{T_{i}\left( {Q\left( \theta_{k} \right)} \right)}} \right){\Lambda_{i}\left( {\prod\limits_{k}^{\frac{n{({n - 1})}}{2}}{T_{i}\left( {Q\left( \theta_{k}^{\prime} \right)} \right)}} \right)}}} & {{Eq}.\mspace{14mu} 11}\end{matrix}$

The quantized ANN weight tiles are formed into a quantized ANN weightmatrix, and the quantized ANN matrix is formed into ANN weights.Finally, the approximation errors may be reduced or eliminated by ANNretraining. During retraining, the original weights are replaced byphase quantized weight Q_(p)(W) in the forward path computation, and, inthe backward path computation, the gradient update to the weights isbackpropagated. In certain embodiments, phase shift gradients are notgenerated, and, instead, the SVD of each “n×n” matrix tile is computedfor each training step. Advantageously, through ANN training, thequantized phase may be efficiently encoded in the lowest bit-width.

Generally, there are many phase shift values ϕ_(i) in a weight tile. Incertain embodiments, each weight in a weight tile includes a quantizedphase shift value θ_(i) and a phase shift value In other embodiments,there may be fewer phase shift value ϕ_(i)—for example, for a “64×64”weight tile (4,096 quantized phase shift values θ_(i)), there are 1,984phase shift values ϕ_(i) (i.e., 64*(63−1)/2). In certain embodiments,one phase shift value ϕ_(i) may be either 0° or 180°, such as, forexample, the last phase shift value ϕ_(N), while the remaining phaseshift values ϕ_(i) are 0°. In these embodiments, the phase shift valuesϕ_(i) may be encoded at near zero cost since only one phase shift valueϕ_(i) per weight tile, equal to 0° or 180°, needs to be encoded.

FIG. 8 depicts a block diagram of an OHA 270 for an ANN, in accordancewith an alternative embodiment of the present disclosure.

OHA 270 includes controller 272, communication bus interface 274, memory276, one or more DCEs 278 and one or more OCEs 280. Controller 272,communication bus interface 274, memory 276 and OCE 280 provide the sameor similar functionality as controller 172, communication bus interface174, memory 176, and OCE 180, respectively. Generally, OHA 270 receivesinput data from memory 130 over communication bus 110, and transmitsoutput data to memory 130 over communication bus 110.

In one embodiment, OCE 280 executes the input, convolutional, activationand pooling layers of a CNN, while DCE 278 executes the fully-connectedand output layers; other configurations are also contemplated. OCE 280reads the input data from memory 276, and generates and storesintermediate output data in memory 276. DCE 278 reads the intermediateoutput data from memory 276, and generates and stores the output data inmemory 276.

DCE 278 may include an interface to memory 276, local volatile ornon-volatile memory, and one or more processors. In one embodiment, theprocessors execute the fully-connected and output layers of a CNN. Themodel and weights for the fully-connected and output layers of the CNNare stored in local non-volatile memory, or, alternatively, in memory276. In this embodiment, a single DCE 278 executes the fully-connectedand output layers of the CNN. Other embodiments execute other types ofANNs.

In further embodiments, multiple DCEs 278 may be interconnected by a NoCusing a ring topology, a star topology, a mesh topology, etc.Alternatively, multiple DCEs 278 may be interconnected using a cross-barswitch, direct connections, etc.

FIG. 9 depicts a flow diagram 300 presenting functionality foraccelerating an ANN using OHA 170, in accordance with embodiments of thepresent disclosure.

At 310, input data are received via communication bus interface 174.

At 320, 330, 340 and 350, the ANN model with ANN weights is executed byOCE 180. Each ANN weight includes a quantized phase shift value θ_(i)and a phase shift value ϕ_(i).

At 320, input optical signals are generated, by D/O converter 186, basedon the input data.

Each OU 192 executes the functionality at 330 and 340. As discussedabove, ONN 190 is configured to generate output optical signals based onthe input optical signals and includes an array of OUs 192 that areoptically-coupled, in a particular configuration, to implement aparticular ANN model. Accordingly, one or more OUs 192 are arranged toimplement each layer of the ANN model. Each OU 192 includes an OMACmodule 200, In an alternative embodiment, each OU 192 also includes anassociated OA module 230.

At 330, corresponding ANN weights are applied to optical signals by eachOMAC module 200.

In an alternative embodiment, at 340, a nonlinear phase shift isapplied, by each OA module 230, to the optical signals from theassociated OMAC module 200.

At 350, output data are generated, by O/D converter 188, based on theoutput optical signals from ONN 190.

At 360, the output data are transmitted via communication bus interface174.

As discussed above, embodiments of the present disclosure advantageouslyprovide an optical hardware accelerator (OHA) for an ANN that includesan optical computing engine (OCE) that is configured to execute an ANNmodel with ANN weights. The ANN weights are phase shift values that aredetermined and quantized during the training of the ANN model. Due tothe physical properties of the optical medium and the quantized phaseshift values, the ANN optical hardware accelerator advantageouslyprovides much faster computation, significantly reduces powerconsumption, and reduces memory bandwidth when compared to conventionalANN hardware accelerators.

The embodiments described herein are combinable.

In one embodiment, an optical computing engine (OCE) is configured toexecute at least a portion of an artificial neural network (ANN) modelwith ANN weights, each ANN weight including a quantized phase shiftvalue θ_(i) and a phase shift value ϕ_(i). The OCE includes adigital-to-optical (D/O) converter configured to generate input opticalsignals based on input data; an optical neural network (ONN), configuredto generate output optical signals based on the input optical signals,the ONN including a plurality of optical units (OUs), each OU includingan optical multiply and accumulate (OMAC) module, each OMAC moduleincluding an array of OMAC elements, each OMAC element including aMach-Zehnder Interferometer (MZI) and a single-node phase shifter, eachMZI configured to apply a phase shift equal to the quantized phase shiftvalue θ_(i) of a corresponding ANN weight to an optical signal, eachsingle-node phase shifter configured to apply a phase shift equal to thephase shift value ϕ_(i) of the corresponding ANN weight to the opticalsignal; and an optical-to-digital (O/D) converter configured to generateoutput data based on the output optical signals.

In another embodiment of the OCE, the quantized phase shift value θ_(i)of each ANN weight is quantized to a bit-width that is greater than orequal to 2 bits and less than or equal to 16 bits; and a phase shiftvalue ϕ_(i) of zero (0) represents 0 radians of phase shift, and a phaseshift value ϕ_(i) of one (1) represents π radians of phase shift.

In another embodiment of the OCE, each quantized phase shift value θ_(i)is calculated by multiplying an unquantized phase shift value ϕ_(i) by(27c)/(2⁸) and rounding the result to an 8-bit integer value.

In another embodiment of the OCE, each OU includes an optical activation(OA) module configured to apply a nonlinear phase shift to the opticalsignals from the OMAC module

In one embodiment, an optical hardware accelerator for an artificialneural network (ANN) includes a communication bus interface configuredto receive input data and transmit output data; a memory, coupled to thecommunication bus interface, configured to store the input data and theoutput data; a controller coupled to the communication bus interface andthe memory; and an optical computing engine (OCE), coupled to the memoryand the controller, configured to execute at least a portion of an ANNmodel with ANN weights, each ANN weight including a quantized phaseshift value θ_(i) and a phase shift value ϕ_(i). The OCE includes adigital-to-optical (D/O) converter configured to generate input opticalsignals based on the input data, an optical neural network (ONN),configured to generate output optical signals based on the input opticalsignals, the ONN including a plurality of optical units (OUs), each OUincluding an optical multiply and accumulate (OMAC) module, each OMACmodule including an array of OMAC elements, each OMAC element includinga Mach-Zehnder Interferometer (MZI) and a single-node phase shifter,each MZI configured to apply a phase shift equal to the quantized phaseshift value θ_(i) of a corresponding ANN weight to an optical signal,each single-node phase shifter configured to apply a phase shift equalto the phase shift value ϕ_(i) of the corresponding ANN weight to theoptical signal, and an optical-to-digital (O/D) converter configured togenerate the output data based on the output optical signals

In another embodiment of the optical hardware accelerator, the quantizedphase shift value θ_(i) of each ANN weight is quantized to a bit-widththat is greater than or equal to 2 bits and less than or equal to 16bits; and a phase shift value ϕ_(i) of zero (0) represents 0 radians ofphase shift, and a phase shift value ϕ_(i) of one (1) represents πradians of phase shift

In another embodiment of the optical hardware accelerator, eachquantized phase shift value θ_(i) is calculated by multiplying anunquantized phase shift value Θ_(i) by (2π)/(2⁸) and rounding the resultto an 8-bit integer value.

In another embodiment of the optical hardware accelerator, each OUincludes an optical activation (OA) module configured to apply anonlinear phase shift to the optical signals from the OMAC module.

In another embodiment of the optical hardware accelerator, the ANN modelis a convolutional neural network (CNN) model including at least oneconvolutional layer, the portion of the ANN model is at least a portionof the convolutional layer, and the quantized phase shift value θ_(i)and the phase shift value ϕ_(i) of each ANN weight are determined duringCNN model training.

In another embodiment of the optical hardware accelerator, the CNN modeltraining includes back propagation of errors without gradient descent.

In another embodiment of the optical hardware accelerator, the ANN modelincludes an input layer, one or more hidden layers, and an output layer,and each OU is assigned to one of the ANN model layers; the inputoptical signals are provided to the OUs in the input layer; opticalsignals generated by the OUs in the input layer are provided to the OUsin a succeeding hidden layer; optical signals generated by the OUs ineach hidden layer are provided to succeeding OUs in the same hiddenlayer, the OUs in a succeeding hidden layer or the OUs in the outputlayer; and optical signals generated by the OUs in the output layer areprovided to the O/D converter.

In another embodiment of the optical hardware accelerator, at least oneof the hidden layers is a fully-connected layer with digital weights,and the optical hardware accelerator further comprises a digitalcomputing engine (DCE), coupled to the memory and the controller,configured to execute the fully-connected layer using the digitalweights.

In one embodiment, a method for accelerating an artificial neuralnetwork (ANN) using an optical hardware accelerator includes receiving,via a communications bus interface, input data; executing, by an opticalcomputing engine (OCE), at least a portion of an ANN model with ANNweights, each ANN weight including a quantized phase shift value θ_(i)and a phase shift value ϕ_(i), the OCE including a digital-to-optical(D/O) converter, an optical neural network (ONN) configured to generateoutput optical signals based on input optical signals, and anoptical-to-digital (O/D) converter, the ONN including a plurality ofoptical units (OUs), each OU including an optical multiply andaccumulate (OMAC) module, each OMAC module including an array of OMACelements, each OMAC element including a Mach Zehnder Interferometer(MZI) and a single-node phase shifter, the executing including at theD/O converter, generating the input optical signals based on the inputdata, at each OMAC element of each OMAC module, applying, by the MZI, aphase shift equal to the quantized phase shift value θ_(i) of acorresponding ANN weight to an optical signal, and applying, by thesingle-node phase shifter, a phase shift equal to the phase shift valueϕ_(i) of the corresponding ANN weight to the optical signal, and at theO/D converter, generating output data based on the output opticalsignals; and transmitting, via the communications bus interface, theoutput data.

In another embodiment of the method, the quantized phase shift valueθ_(i) of each ANN weight is quantized to a bit-width that is greaterthan or equal to 2 bits and less than or equal to 16 bits; and a phaseshift value ϕ_(i) of zero (0) represents 0 radians of phase shift, and aphase shift value ϕ_(i) of one (1) represents π radians of phase shift.

In another embodiment of the method, each quantized phase shift valueθ_(i) is calculated by multiplying an unquantized phase shift valueΘ_(i) by (2π)/(2⁸) and rounding the result to an 8-bit integer value.

In another embodiment of the method, each OU includes an opticalactivation (OA) module, and the method further comprises at each OAmodule, applying a nonlinear phase shift to the optical signals from theOMAC module.

In another embodiment of the method, the ANN model is a convolutionalneural network (CNN) model including at least one convolutional layer,the portion of the ANN model is at least a portion of the convolutionallayer, and the quantized phase shift value θ_(i) and the phase shiftvalue ϕ_(i) of each ANN weight are determined during CNN model training.

In another embodiment of the method, the CNN model training includesback propagation of errors without gradient descent.

In another embodiment of the method, the ANN model includes an inputlayer, one or more hidden layers, and an output layer, each OU isassigned to one of the ANN model layers, and the method furthercomprises

In another embodiment of the method, providing the input optical signalsto the OUs in the input layer; providing optical signals generated bythe OUs in the input layer to the OUs in a succeeding hidden layer;providing optical signals generated by the OUs in each hidden layer tosucceeding OUs in the same hidden layer, the OUs in a succeeding hiddenlayer or the OUs in the output layer; and providing optical signalsgenerated by the OUs in the output layer to the O/D converter.

In another embodiment of the method, the CNN model includes at least onefully-connected layer with digital weights, and the method furthercomprises executing, by a digital computing engine (DCE), thefully-connected layer using the digital weights.

In one embodiment, a method for training an artificial neural network(ANN) for use with an optical hardware accelerator includes training anANN model to determine unquantized ANN weights, each unquantized ANNweight including an unquantized phase shift value Θ_(i) and a phaseshift value ϕ_(i); forming an ANN weight matrix from the unquantized ANNweights; decomposing the ANN weight matrix into a plurality of ANNweight tiles, each ANN weight tile including a number of unquantized ANNweights; quantizing each ANN weight tile to create a quantized ANNweight tile, each quantized ANN weight tile including a number ofquantized ANN weights, each quantized ANN weight including a quantizedphase shift value θ_(i) and a phase shift value ϕ_(i); forming thequantized ANN weight tiles into a quantized ANN weight matrix; formingthe quantized ANN matrix into ANN weights; and retraining the ANN modelbased on the ANN weights.

In another embodiment of the training method, the quantized phase shiftvalue θ_(i) of each ANN weight is quantized to a bit-width that isgreater than or equal to 2 bits and less than or equal to 16 bits, aphase shift value ϕ_(i) of zero (0) represents 0 radians of phase shift,and a phase shift value ϕ_(i) of one (1) represents π radians of phaseshift.

In another embodiment of the training method, each quantized phase shiftvalue θ_(i) is calculated by multiplying an unquantized phase shiftvalue Θ_(i) by (2π)/(2⁸) and rounding the result to an 8-bit integervalue.

In another embodiment of the training method, the ANN model is aconvolutional neural network (CNN) model including at least oneconvolutional layer, and the training includes back propagation oferrors without gradient descent.

While implementations of the disclosure are susceptible to embodiment inmany different forms, there is shown in the drawings and will herein bedescribed in detail specific embodiments, with the understanding thatthe present disclosure is to be considered as an example of theprinciples of the disclosure and not intended to limit the disclosure tothe specific embodiments shown and described. In the description above,like reference numerals may be used to describe the same, similar orcorresponding parts in the several views of the drawings.

In this document, relational terms such as first and second, top andbottom, and the like may be used solely to distinguish one entity oraction from another entity or action without necessarily requiring orimplying any actual such relationship or order between such entities oractions. The terms “comprises,” “comprising,” “includes,” “including,”“has,” “having,” or any other variations thereof, are intended to covera non-exclusive inclusion, such that a process, method, article, orapparatus that comprises a list of elements does not include only thoseelements but may include other elements not expressly listed or inherentto such process, method, article, or apparatus. An element preceded by“comprises . . . a” does not, without more constraints, preclude theexistence of additional identical elements in the process, method,article, or apparatus that comprises the element.

Reference throughout this document to “one embodiment,” “certainembodiments,” “an embodiment,” “implementation(s),” “aspect(s),” orsimilar terms means that a particular feature, structure, orcharacteristic described in connection with the embodiment is includedin at least one embodiment of the present disclosure. Thus, theappearances of such phrases or in various places throughout thisspecification are not necessarily all referring to the same embodiment.Furthermore, the particular features, structures, or characteristics maybe combined in any suitable manner in one or more embodiments withoutlimitation.

The term “or” as used herein is to be interpreted as an inclusive ormeaning any one or any combination. Therefore, “A, B or C” means “any ofthe following: A; B; C; A and B; A and C; B and C; A, B and C.” Anexception to this definition will occur only when a combination ofelements, functions, steps or acts are in some way inherently mutuallyexclusive. Also, grammatical conjunctions are intended to express anyand all disjunctive and conjunctive combinations of conjoined clauses,sentences, words, and the like, unless otherwise stated or clear fromthe context. Thus, the term “or” should generally be understood to mean“and/or” and so forth. References to items in the singular should beunderstood to include items in the plural, and vice versa, unlessexplicitly stated otherwise or clear from the text.

Recitation of ranges of values herein are not intended to be limiting,referring instead individually to any and all values falling within therange, unless otherwise indicated, and each separate value within such arange is incorporated into the specification as if it were individuallyrecited herein. The words “about,” “approximately,” or the like, whenaccompanying a numerical value, are to be construed as indicating adeviation as would be appreciated by one of ordinary skill in the art tooperate satisfactorily for an intended purpose. Ranges of values and/ornumeric values are provided herein as examples only, and do notconstitute a limitation on the scope of the described embodiments. Theuse of any and all examples, or exemplary language (“e.g.,” “such as,”“for example,” or the like) provided herein, is intended merely tobetter illuminate the embodiments and does not pose a limitation on thescope of the embodiments. No language in the specification should beconstrued as indicating any unclaimed element as essential to thepractice of the embodiments.

For simplicity and clarity of illustration, reference numerals may berepeated among the figures to indicate corresponding or analogouselements. Numerous details are set forth to provide an understanding ofthe embodiments described herein. The embodiments may be practicedwithout these details. In other instances, well-known methods,procedures, and components have not been described in detail to avoidobscuring the embodiments described. The description is not to beconsidered as limited to the scope of the embodiments described herein.

In the following description, it is understood that terms such as“first,” “second,” “top,” “bottom,” “up,” “down,” “above,” “below,” andthe like, are words of convenience and are not to be construed aslimiting terms. Also, the terms apparatus, device, system, etc. may beused interchangeably in this text.

The many features and advantages of the disclosure are apparent from thedetailed specification, and, thus, it is intended by the appended claimsto cover all such features and advantages of the disclosure which fallwithin the scope of the disclosure. Further, since numerousmodifications and variations will readily occur to those skilled in theart, it is not desired to limit the disclosure to the exact constructionand operation illustrated and described, and, accordingly, all suitablemodifications and equivalents may be resorted to that fall within thescope of the disclosure.

What is claimed is:
 1. An optical computing engine (OCE), configured toexecute at least a portion of an artificial neural network (ANN) modelwith ANN weights, each ANN weight including a quantized phase shiftvalue θ_(i) and a phase shift value ϕ_(i), the OCE comprising: adigital-to-optical (D/O) converter configured to generate input opticalsignals based on input data; an optical neural network (ONN), configuredto generate output optical signals based on the input optical signals,the ONN including a plurality of optical units (OUs), each OU includingan optical multiply and accumulate (OMAC) module, each OMAC moduleincluding an array of OMAC elements, each OMAC element including aMach-Zehnder Interferometer (MZI) and a single-node phase shifter, eachMZI configured to apply a phase shift equal to the quantized phase shiftvalue θ_(i) of a corresponding ANN weight to an optical signal, eachsingle-node phase shifter configured to apply a phase shift equal to thephase shift value ϕ_(i) of the corresponding ANN weight to the opticalsignal; and an optical-to-digital (O/D) converter configured to generateoutput data based on the output optical signals.
 2. The OCE of claim 1,where: the quantized phase shift value θ_(i) of each ANN weight isquantized to a bit-width that is greater than or equal to 2 bits andless than or equal to 16 bits; and a phase shift value ϕ_(i) of zero (0)represents 0 radians of phase shift, and a phase shift value ϕ_(i) ofone (1) represents π radians of phase shift.
 3. The OCE of claim 2,where each quantized phase shift value θ_(i) is calculated bymultiplying an unquantized phase shift value Θ_(i) by (2π)/(2⁸) androunding the result to an 8-bit integer value.
 4. The OCE of claim 1,where each OU includes an optical activation (OA) module configured toapply a nonlinear phase shift to the optical signals from the OMACmodule.
 5. An optical hardware accelerator for an artificial neuralnetwork (ANN), comprising: a communication bus interface configured toreceive input data and transmit output data; a memory, coupled to thecommunication bus interface, configured to store the input data and theoutput data; a controller coupled to the communication bus interface andthe memory; and an optical computing engine (OCE), coupled to the memoryand the controller, configured to execute at least a portion of an ANNmodel with ANN weights, each ANN weight including a quantized phaseshift value θ_(i) and a phase shift value ϕ_(i), the OCE including: adigital-to-optical (D/O) converter configured to generate input opticalsignals based on the input data, an optical neural network (ONN),configured to generate output optical signals based on the input opticalsignals, the ONN including a plurality of optical units (OUs), each OUincluding an optical multiply and accumulate (OMAC) module, each OMACmodule including an array of OMAC elements, each OMAC element includinga Mach-Zehnder Interferometer (MZI) and a single-node phase shifter,each MZI configured to apply a phase shift equal to the quantized phaseshift value θ_(i) of a corresponding ANN weight to an optical signal,each single-node phase shifter configured to apply a phase shift equalto the phase shift value ϕ_(i) of the corresponding ANN weight to theoptical signal, and an optical-to-digital (O/D) converter configured togenerate the output data based on the output optical signals.
 6. Theoptical hardware accelerator of claim 5, where: the quantized phaseshift value θ_(i) of each ANN weight is quantized to a bit-width that isgreater than or equal to 2 bits and less than or equal to 16 bits; and aphase shift value ϕ_(i) of zero (0) represents 0 radians of phase shift,and a phase shift value ϕ_(i) of one (1) represents π radians of phaseshift.
 7. The optical hardware accelerator of claim 6, where eachquantized phase shift value θ_(i) is calculated by multiplying anunquantized phase shift value Θ_(i) by (2π)/(2⁸) and rounding the resultto an 8-bit integer value.
 8. The optical hardware accelerator of claim5, where each OU includes an optical activation (OA) module configuredto apply a nonlinear phase shift to the optical signals from the OMACmodule.
 9. The optical hardware accelerator of claim 5, where the ANNmodel is a convolutional neural network (CNN) model including at leastone convolutional layer, the portion of the ANN model is at least aportion of the convolutional layer, and the quantized phase shift valueθ_(i) and the phase shift value ϕ_(i) of each ANN weight are determinedduring CNN model training.
 10. The optical hardware accelerator of claim9, where the CNN model training includes back propagation of errorswithout gradient descent.
 11. The optical hardware accelerator of claim5, where: the ANN model includes an input layer, one or more hiddenlayers, and an output layer, and each OU is assigned to one of the ANNmodel layers; the input optical signals are provided to the OUs in theinput layer; optical signals generated by the OUs in the input layer areprovided to the OUs in a succeeding hidden layer; optical signalsgenerated by the OUs in each hidden layer are provided to succeeding OUsin the same hidden layer, the OUs in a succeeding hidden layer or theOUs in the output layer; and optical signals generated by the OUs in theoutput layer are provided to the O/D converter.
 12. The optical hardwareaccelerator of claim 11, where at least one of the hidden layers is afully-connected layer with digital weights, and the optical hardwareaccelerator further comprises a digital computing engine (DCE), coupledto the memory and the controller, configured to execute thefully-connected layer using the digital weights.
 13. A method foraccelerating an artificial neural network (ANN) using an opticalhardware accelerator, comprising: receiving, via a communications businterface, input data; executing, by an optical computing engine (OCE),at least a portion of an ANN model with ANN weights, each ANN weightincluding a quantized phase shift value θ_(i) and a phase shift valueϕ_(i), the OCE including a digital-to-optical (D/O) converter, anoptical neural network (ONN) configured to generate output opticalsignals based on input optical signals, and an optical-to-digital (O/D)converter, the ONN including a plurality of optical units (OUs), each OUincluding an optical multiply and accumulate (OMAC) module, each OMACmodule including an array of OMAC elements, each OMAC element includinga Mach-Zehnder Interferometer (MZI) and a single-node phase shifter, theexecuting including: at the D/O converter, generating the input opticalsignals based on the input data, at each OMAC element of each OMACmodule, applying, by the MZI, a phase shift equal to the quantized phaseshift value θ_(i) of a corresponding ANN weight to an optical signal,and applying, by the single-node phase shifter, a phase shift equal tothe phase shift value ϕ_(i) of the corresponding ANN weight to theoptical signal, and at the O/D converter, generating output data basedon the output optical signals; and transmitting, via the communicationsbus interface, the output data.
 14. The method of claim 13, where: thequantized phase shift value θ_(i) of each ANN weight is quantized to abit-width that is greater than or equal to 2 bits and less than or equalto 16 bits; and a phase shift value ϕ_(i) of zero (0) represents 0radians of phase shift, and a phase shift value ϕ_(i) of one (1)represents π radians of phase shift.
 15. The method of claim 14, whereeach quantized phase shift value θ_(i) is calculated by multiplying anunquantized phase shift value Θ_(i) by (2π)/(2⁸) and rounding the resultto an 8-bit integer value.
 16. The method of claim 13, where each OUincludes an optical activation (OA) module, and the method furthercomprises: at each OA module, applying a nonlinear phase shift to theoptical signals from the OMAC module.
 17. The method of claim 13, wherethe ANN model is a convolutional neural network (CNN) model including atleast one convolutional layer, the portion of the ANN model is at leasta portion of the convolutional layer, and the quantized phase shiftvalue θ_(i) and the phase shift value ϕ_(i) of each ANN weight aredetermined during CNN model training.
 18. The method of claim 17, wherethe CNN model training includes back propagation of errors withoutgradient descent.
 19. The method of claim 13, where the ANN modelincludes an input layer, one or more hidden layers, and an output layer,each OU is assigned to one of the ANN model layers, and the methodfurther comprises: providing the input optical signals to the OUs in theinput layer; providing optical signals generated by the OUs in the inputlayer to the OUs in a succeeding hidden layer; providing optical signalsgenerated by the OUs in each hidden layer to succeeding OUs in the samehidden layer, the OUs in a succeeding hidden layer or the OUs in theoutput layer; and providing optical signals generated by the OUs in theoutput layer to the O/D converter.
 20. The method of claim 17, where theCNN model includes at least one fully-connected layer with digitalweights, and the method further comprises: executing, by a digitalcomputing engine (DCE), the fully-connected layer using the digitalweights.
 21. A method for training an artificial neural network (ANN)for use with an optical hardware accelerator, comprising: training anANN model to determine unquantized ANN weights, each unquantized ANNweight including an unquantized phase shift value ϕ_(i) and a phaseshift value ϕ_(i); forming an ANN weight matrix from the unquantized ANNweights; decomposing the ANN weight matrix into a plurality of ANNweight tiles, each ANN weight tile including a number of unquantized ANNweights; quantizing each ANN weight tile to create a quantized ANNweight tile, each quantized ANN weight tile including a number ofquantized ANN weights, each quantized ANN weight including a quantizedphase shift value θ_(i) and a phase shift value ϕ_(i); forming thequantized ANN weight tiles into a quantized ANN weight matrix; formingthe quantized ANN matrix into ANN weights; and retraining the ANN modelbased on the ANN weights.
 22. The method of claim 21, where thequantized phase shift value θ_(i) of each ANN weight is quantized to abit-width that is greater than or equal to 2 bits and less than or equalto 16 bits, a phase shift value ϕ_(i) of zero (0) represents 0 radiansof phase shift, and a phase shift value ϕ_(i) of one (1) represents πradians of phase shift.
 23. The method of claim 22, where each quantizedphase shift value θ_(i) is calculated by multiplying an unquantizedphase shift value Θ_(i) by (2π)/(2⁸) and rounding the result to an 8-bitinteger value.
 24. The method of claim 21, where the ANN model is aconvolutional neural network (CNN) model including at least oneconvolutional layer, and the training includes back propagation oferrors without gradient descent.